Chapter 8
Strings and Unicode

One of the more common sources of pain when writing Python applications is the handling of string data, specifically when strings contain characters outside of common Latin characters.

One of the first standards developed for representing string data is known as ASCII, which stands for American Standard Code for Information Interchange. ASCII defines a dictionary for representing common characters such as “A” through “Z” (in both upper- and lowercase), the digits “0” through “9,” and a few common symbols (such as period, question mark, and so on).

However, ASCII relies upon an assumption that each character maps to a single byte, and, therefore, runs into trouble because there are far too many characters. As a result, a standard known as Unicode is now used to render text.

In Python, there are two different kinds of string data: text strings and byte strings. It is also possible to convert one type to the other. It is important to understand which kind of data you are dealing with, and to consistently keep the kinds of data straight.

In this chapter, you learn about the difference between text strings and byte strings, and how the types are implemented in both Python 2 and Python 3. You also learn how to deal with common problems that can pop up when you're working with string data within Python programs.

Text String Versus Byte String

Data is consistently stored in bytes. Character sets such as ASCII and Unicode are responsible for using byte data to render the appropriate text.

ASCII's approach to this is straightforward. It defines a mapping table, and each character corresponds to 7 bits. A common superset of ASCII, latin-1 (discussed in more detail later), maintains this system, but uses 8 bits. Ordinarily, you represent bytes as either a decimal or hexadecimal number. Therefore, whenever the ASCII codec encounters the byte represented by the decimal number 65 (or hex 0x41), it knows that this corresponds to the character A.

In fact, Python itself defines two functions for converting between a single integer byte and the corresponding character: ord and chr. The abbreviation “ord” stands for “ordinal.” The ord function takes a character and returns the integer corresponding to that character in the ASCII table, as shown here:

>>> ord('A')
65

The chr method does the opposite. It accepts an integer and returns the corresponding character on the ASCII table, as shown here:

>>> chr(65)
'A'
>>> chr(0x41)
'A'

The fundamental problem with ASCII is its assumption of a 1:1 mapping between characters and bytes. This is a serious limitation, because 256 characters is not nearly enough to include the various glyphs in different languages. Unicode solves this problem by using up to 4 bytes to represent each character.

String Data in Python

The Python language actually has two different kinds of strings: one for storing text, and one for storing raw bytes. A text string stores data internally as Unicode, whereas a byte string stores raw bytes and displays ASCII (for example, when sent to print).

Adding to the confusion, Python 2 and Python 3 use different (but overlapping) names for their text strings and byte strings. The Python 3 terminology makes more sense, so you should learn it and then translate to Python 2 when working there.

Python 3 Strings

In Python 3, the text string type (which stores Unicode data) is called str, and the byte string type is called bytes. Instantiating a string normally gives you a str instance, as shown here:

>>> text_str = 'The quick brown fox jumped over the lazy dogs.'
>>> type(text_str)
<class 'str'>

If you want to get a bytes instance, you prefix the literal with the b character.

>>> byte_str = b'The quick brown fox jumped over the lazy dogs.'
>>> type(byte_str)
<class 'bytes'>

It is possible to convert between a str and a bytes. The str class includes an encode method, which converts into a bytes using the specified codec. In most cases, you want to use UTF-8 as a codec when encoding data. The encode method takes a required argument, which is the string representing the appropriate codec.

>>> text_str.encode('utf-8')
b'The quick brown fox jumped over the lazy dogs.'

Similarly, the bytes class includes a decode method, which also takes the codec as a single, required argument, and returns a str. Decoding is a more interesting issue, though. It is insufficient to dogmatically say that you should always decode data as UTF-8, because data from another source may not have been encoded as UTF-8. You must decode data according to how it was encoded. You learn more about this later in this chapter.

Python 3 will never attempt to implicitly convert between a str and a bytes. Its approach is to require you to explicitly convert between text strings and byte strings with the str.encode and bytes.decode methods (a practice that requires you to specify a codec). For most applications, this is a preferable behavior, because it helps you avoid getting into situations where programs work when given common English text, but fail when running into unexpected characters.

This also means that text strings containing only ASCII characters are not considered to be equal to byte strings containing only ASCII characters.

>>> 'foo' == b'foo'
False
>>>
>>> d = {'foo': 'bar'}
>>> d[b'foo']
Traceback (most recent call last):
  File "stdin", line 1, in <module>
KeyError: b'foo'

Attempting to do nearly any operation on a text string and byte string together will raise TypeError, as shown here:

>>> 'foo' + b'bar'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly

One exception to this behavior is the % operator, which is used for string formatting in Python. Attempting to interpolate a text string into a byte string will raise TypeError as expected.

>>> b'foo %s' % 'bar'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for %: 'bytes' and 'str'

On the other hand, interpolating a byte string into a text string does work, but does not return the intuitively desired response.

>>> 'foo %s' % b'bar'
"foo b'bar'"

What is occurring here is that the operator takes the b'bar' value, which is a bytes. It first looks for a __str__ method, which the bytes object actually does have. It returns the text string "b'bar'", with the b' prefix and ‘ suffix. This is the same value returned by __repr__.

Python 2 Strings

Python 2 strings mostly work similarly, but with some subtle but very important distinctions.

The first distinction is the name of the classes. The Python 3 str class is called unicode in Python 2. In and of itself, this is fine. However, the Python 3 bytes class is called str in Python 2. This means that a Python 3 str is a text string, whereas a Python 2 str is a byte string. If you are using Python 2, it is critically important to understand this distinction.

Instantiating a string with no prefix gives you a str (remember, this is a byte string!) instance.

>>> byte_str = 'The quick brown fox jumped over the lazy dogs.'
>>> type(byte_str)
<type 'str'>

If you want a text string in Python 2, you prefix the string literal with the u character, as shown here:

>>> text_str = u'The quick brown fox jumped over the lazy dogs.'
>>> type(text_str)
<type 'unicode'>

Unlike Python 3, Python 2 does attempt to implicitly convert between text strings and byte strings. The way that this works is that if the interpreter encounters a mixed operation, it will first convert the byte string to a text string, and then perform the operation against the text strings.

It works this way so that an operation against a byte string and a text string will return a text string:

>>> 'foo' + u'bar'
u'foobar'

The interpreter performs this implicit decoding using whatever the default encoding is. On Python 2, this is almost always ASCII. Python defines a method, sys.getdefaultencoding, which provides the default codec for implicitly converting between text strings and byte strings.

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

This means that many of the previous Python 3 examples show distinctly different behavior in Python 2.

>>> 'foo' == u'foo'
True
>>>
>>> d = {u'foo': u'bar'}
>>> d['foo']
u'bar'

str.encode and unicode.decode

One somewhat bizarre aspect of Python 2's string-handling behavior is that text strings actually have a decode method, and byte strings actually have an encode method.

You never want to use these.

The theoretical purpose of these methods is to ensure that you don't worry too much about what the input variable is. Simply call encode to change either kind of string into a byte string, or decode to change either kind of string into a text string.

In practice, however, this can be both disastrous and very confusing, because if the method receives the “wrong” kind of input string (that is, a string already of the desired output type), it will attempt two conversions, and attempt the implicit one using ASCII.

Consider this Python 2 example:

>>> text_str = u'\u03b1 is for alpha.'
>>>
>>> text_str.encode('utf-8')
'\xce\xb1 is for alpha.'
>>>
>>> text_str.encode('utf-8').encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0:
     ordinal not in range(128)

It seems quite bizarre to be asking to encode something as UTF-8 and to get an error back complaining that the text is unable to be decoded as ASCII. But this is the implicit conversion that Python 2 is attempting to do in order to run encode (a method intended for text strings) on a byte string.

To the interpreter, the final line is equivalent to the following:

text_str.encode('utf-8').decode('ascii').encode('utf-8')

That is never what you want.

It seems simple enough not to do this, but the way you encounter an error like this is not to bluntly run encode twice (as this example does), but rather to run encode or decode without first checking to see what kind of data you have. In Python 2, text strings and byte strings intermingle frequently, and it is very easy to get one when you expected the other.

unicode_literals

If you are using Python 2.6 or greater, you can make part of this behavior track the Python 3 behavior if you choose to do so. Python defines a special module called __future__, from which you can preemptively opt-in to future behavior.

In this case, importing unicode_literals causes string literals to follow the Python 3 convention, although the Python 2 class names are still used.

>>> from __future__ import unicode_literals
>>> text_str = 'The quick brown fox jumped over the lazy dogs.'
>>> type(text_str)
<type 'unicode'>
>>> bytes_str = b'The quick brown fox jumped over the lazy dogs.'
>>> type(bytes_str)
<type 'str'>

Once from __future__ import unicode_literals is invoked, a string literal with no prefix in Python 2.6 or greater becomes a text string (unicode), and a b prefix creates a byte string (Python 2 str).

Doing this does not forward-port other aspects of Python 2's string handling to the Python 3 behavior. The interpreter will still attempt to implicitly convert between text strings and byte strings, and ASCII is still the default encoding.

Nonetheless, most strings specified in code are intended to be text strings rather than byte strings. Therefore, if you are writing code that does not need to support versions of Python below Python 2.6, it is very wise to use this.

six

The fact that Python 2 and Python 3 provide different class names for text strings and byte strings can be a source of confusion, although the transition to the much clearer Python 3 nomenclature is an important one.

To help cope with this, the popular Python library six, which is centered around writing modules that run correctly in both Python 2 and Python 3 (and which is covered in much more detail in Chapter 10, “Python 2 Versus Python 3”), provides aliases for these types so that they can be consistently referenced in code that must run on both platforms. The class for text strings (str in Python 3 and unicode in Python 2) is aliased as six.text_type, whereas the class for byte strings (bytes in Python 3 and str in Python 2) is aliased as six.binary_type.

Strings with Non-ASCII Characters

Most Python programs, and nearly any program that handles user input (whether it be direct input, from a file, from a database, and so on) must be able to handle arbitrary characters, including those not found on the ASCII table. Converting ASCII characters between text strings and byte strings is trivial (in the utf-8 codec, it is actually a no-op). The complexity arrives when non-ASCII characters are in play, especially if text strings and byte strings are being used without sufficient regard to which is which.

Observing the Difference

Consider a text string that contains non-ASCII characters, such as the text string in the following code, which says “Hello, world” Google-translated into Greek (note that this is Python 3 code):

>>> text_str = 'Γεια σας, τον κόσμο'
>>> type(text_str)
<class 'str'>

The first thing to note about this text string is that it cannot be encoded to a bytes instance using the ascii codec at all.

>>> text_str.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3:
     ordinal not in range(128)

This is because ASCII does not have Greek characters, so the ASCII codec does not have any way to translate them into raw byte data. This is fine, though, because that is what the utf-8 codec is for, as shown here:

>>> text_str.encode('utf-8')
b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xb1\xcf\x82,
\xcf\x84\xce\xbf\xce\xbd \xce\xba\xcf\x8c\xcf\x83\xce\xbc\xce\xbf.'

Several things are worth noting at this point. First and foremost, this is the first string you have encountered where the text string and the byte string look substantially different. The repr of the text string looks like human-readable Greek, whereas the repr of the byte string looks like it is intended to be machine-readable.

Also, notice that the lengths of the strings are actually not the same.

>>> byte_str = text_str.encode('utf-8')
>>> len(text_str)
20
>>> len(byte_str)
35

Why is this? Remember the problem that Unicode exists to solve: ASCII assumes a 1:1 correlation between bytes and characters, which puts a substantial limitation on the number of characters available.

Unicode allows for many more characters to exist by breaking out of this limitation. UTF-8 characters are variable length. A single Unicode character may be as small as a single byte (for the characters on the ASCII table), or as large as 4 bytes.

In the case of the example Greek text, most characters are 2 bytes, which is why the len of the byte string is almost double the len of the text string. However, the spaces, period, and comma (visible in the byte string as such) are all ASCII characters, and only take 1 byte each.

Unicode Is a Superset of ASCII

Why do the text strings and byte strings that only contain ASCII characters look so similar when printed, but the Unicode strings look so different?

By convention, you print the bytes in the ASCII range as their ASCII characters. Additionally, Unicode is structured in such a way as to make it an exact superset of ASCII. This means that the characters in the Latin alphabet, as well as the common punctuation symbols, are represented the same way in Unicode strings as well as byte strings.

This has another important meaning. Any valid ASCII text is also valid Unicode text.

Other Encodings

Unicode is not the only encoding available to convert between raw byte data and a readable textual representation. Many others have been put forward, and some are in common use.

One common encoding is formally known as the ISO-8859 standard, and colloquially called latin-1. (For clarity, the remainder of this chapter will use “Latin-1” to refer to this rather than ISO-8859.)

Like Unicode, this encoding is a superset of ASCII, and adds support for glyphs found in many different languages other than English. However, as its name suggests, it is designed only to support languages that rely on Latin glyphs for their letters, and is not suitable for rendering languages that use other alphabets (such as Greek, Chinese, Japanese, Russian, or Korean, among others).

It would not actually be possible to render the previous Greek string using the latin-1 codec, as the following Python 3 example demonstrates:

>>> text_str = 'Γεια σας, τον κόσμο'
>>> text_str.encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-
     3: ordinal not in range(256)

Encodings Are Not Cross-Compatible

It is important to recognize that while many encodings are structured as supersets of ASCII, they are often not compatible with one another. Outside of ASCII, there is little or no overlap between the latin-1 and utf-8 codecs.

Consider the difference in byte strings encoded using each codec.

>>> text_str = 'El zorro marrón rápido saltó por encima ' + \
...            'de los perros vagos.'
>>> text_str.encode('utf-8')
b'El zorro marr\xc3\xb3n r\xc3\xa1pido salt\xc3\xb3 por encima de los
     perros vagos.'
>>> text_str.encode('latin-1')
b'El zorro marr\xf3n r\xe1pido salt\xf3 por encima de los perros vagos.'

Because of this, a string encoded using one codec is unable to be decoded using the other codec. If you try to take a byte string representing text encoded using latin-1 and decode it as utf-8, the Unicode codec will realize that it is encountering an invalid character sequence and fail.

>>> text_str.encode('latin-1').decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 13:
     invalid continuation byte

Worse, if you try to take a byte string representing text encoded with utf-8 and decode it as latin-1, the (more permissive) codec will successfully return a text string, but with garbled text.

>>> text_str.encode('utf-8').decode('latin-1')
'El zorro marrÃ³n rÃ¡pido saltÃ³ por encima de los perros vagos.'

It is impossible to infer based on the content of a byte string what encoding is in use. However, many common document formats and data-transfer protocols provide a mechanism to declare what encoding is in use. On the other hand, it is also possible that a document will incorrectly specify its character encoding.

Reading Files

Files always store bytes. Therefore, to use textual data read in from files, you must decode it into a text string.

Python 3

In Python 3, files are ordinarily decoded automatically for you. Consider the following file with Unicode text, encoded using UTF-8:

Hello, world.
Γεια σας, τον κόσμο.

Opening and reading this file in Python 3 gives you a text string (not a byte string).

>>> with open('unicode.txt', 'r') as f:
...   text_str = f.read()
...
>>> type(text_str)
<class 'str'>

This code example is making a few critical assumptions that are important to understand.

The biggest assumption being made is how to decode the file. Text files do not declare how they are encoded. There is no way for the interpreter to know whether it is getting UTF-8 text, Latin-1 text, or something else entirely.

Python 3 decides which encoding should be used based on what kind of system it is running on. A function is available to expose this: locale.getpreferredencoding(). On Mac OS X and on most Linux systems, the preferred encoding is UTF-8.

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

However, most Windows systems use a different encoding called Windows-1252 or CP-1252 to encode text files, and running the same code in Python 3 on Windows reflects this.

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

It is important to note explicitly that the preferred encoding that locale.getpreferredencoding() provides is based on how the underlying system operates. It is reflective, not prescriptive. A text file with special characters saved on almost any system (using that system's default tools) and then opened using open in Python 3 will probably be decoded correctly.

However, files are not opened solely on the same type of system on which they are created. This is where the assumption becomes problematic.

Specifying Encoding

Python 3 enables you to explicitly declare the encoding of a file by providing an optional encoding keyword argument to open. This argument accepts a codec, specified as a string, similar to encode and decode.

Because the example Unicode file is stipulated as being encoded using UTF-8, you can explicitly tell the interpreter to decode it as such.

>>> with open('unicode.txt', 'r', encoding='utf-8') as f:
...   text_str = f.read()
...
>>> type(text_str)
<class 'str'>

Because the file was encoded as UTF-8, and the UTF-8 codec was used to decode it, the text string contains the expected data.

>>> text_str
'Hello, world. \nΓεια σας, τον κόσμο.\n'

Reading Bytes

Another implicit assumption being made (which logically precedes which codec to use to decode the file) is that the file should be decoded at all.

You may want to read in the file as a byte string instead of as a text string. There are two common reasons to do this. The most common reason is if you are accepting non-textual data (for example, if you are reading in an image). However, another potential reason is for reading text files with an uncertain encoding.

To read in a byte string instead of a text string, add the character b to the second string argument sent to open. For example, consider reading in the same file containing Unicode as a byte string, as shown here:

>>> with open('unicode.txt', 'rb') as f:
...   byte_str = f.read()
...
>>> type(byte_str)
<class 'bytes'>

Examining the byte_str variable shows the raw bytes in the string for the second line of text.

>>> byte_str
b'Hello, world.\n\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xb1\xcf\x82,
     \xcf\x84\xce\xbf\xce\xbd \xce\xba\xcf\x8c\xcf\x83\xce\xbc\xce\xbf.\n'

This variable can be decoded just as if it were a byte string provided from any other source.

>>> byte_str.decode('utf-8')
'Hello, world. \nΓεια σας, τον κόσμο.\n'

This can be a useful strategy for dealing with a file whose encoding is uncertain. The data can be safely read from the file as bytes, and then the program can attempt to determine programmatically how to decode it.

Python 2

In Python 2, the read method will always return a byte string, regardless of how the file was opened.

>>> with open('unicode.txt', 'r') as f:
...     byte_str = f.read()
...
>>> type(byte_str)
<type 'str'>

Note that the b modifier was not used in the second argument to open, but a str instance (which is a byte string in Python 2) was returned anyway.

You can get a text string by using decode, just like on a byte string that comes from any other source.

>>> byte_str
'Hello, world.\n\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xb1\xcf\x82,
     \xcf\x84\xce\xbf\xce\xbd \xce\xba\xcf\x8c\xcf\x83\xce\xbc\xce\xbf.\n'
>>>
>>> byte_str.decode('utf-8')
u'Hello, world.\n\u0393\u03b5\u03b9\u03b1 \u03c3\u03b1\u03c2,
     \u03c4\u03bf\u03bd \u03ba\u03cc\u03c3\u03bc\u03bf.\n'

Because Python 2 always provides byte strings, the open function does not have an encoding keyword argument, and attempting to provide one will raise TypeError.

If you are writing code that is intended to be run on Python 2, the best and safest way to do so is to always open files in binary mode (using b) and, if you are expecting textual data, decode it yourself.

Reading Other Sources

Textual data is read from many different places, not only from files. Modern programs receive direct user input, accept input over protocols (such as HTTP), read out of databases, and transfer data using serialization formats such as Extensible Markup Language (XML) or JavaScript Object Notation (JSON).

Python provides many libraries and tools for reading data of many types, and from many sources. For example, the json module available in Python 2.6 and later is able to serialize and deserialize JSON data. Furthermore, numerous third-party packages are available that read data from other types or sources. For example, the pyyaml library reads YAML files, and the psycopg2 library reads and writes data from PostgreSQL databases.

Most (but not all) of these libraries return text strings. However, it is your responsibility to familiarize yourself with the libraries you use and to know whether you are getting text strings or byte strings. Also, some libraries may behave differently on different versions of Python, returning byte strings on Python 2 and text strings on Python 3. It is very important to make sure you keep them straight!

Specifying Python File Encodings

Many document formats do provide a means to declare what codec is being used to encode text. For example, an XML file may begin like this:

<?xml version="1.0" encoding="UTF-8"?>

This is a common way to begin an XML file. Pay attention to the encoding attribute. This declares that textual data is encoded using UTF-8. Because the XML file declares that this is the encoding it uses, programs that read XML will use UTF-8 to decode any text it finds from bytes to text.

Sometimes it is necessary for Python source files to declare an encoding. For example, suppose a Python source file includes a string literal containing Unicode characters. On Python 2, the interpreter assumes that Python source files are encoded using ASCII, and this will actually fail.

Consider the following Python module saved as unicode.py:

text_str = u'Γεια σας, τον κόσμο.'
print(text_str)

Running this module in Python 3.3 or greater (because Python 3.0-3.2 lack the u prefix) works without any issues.

$ python3.4 unicode.py
Γεια σας, τον κόσμο.

However, running the same module in Python 2 will fail with a syntax error on the first line, because the Python 2 interpreter wants ASCII.

$ python2.7 unicode.py
  File "unicode.py", line 1
SyntaxError: Non-ASCII character '\xce' in file unicode.py on line 1, but
     no encoding declared; see http://www.python.org/peps/pep-0263.html
     for details

As the error message suggests, Python modules actually can declare an encoding, similar to how an XML file might do so. By default, Python 2 expects files to be encoded as ASCII, and Python 3 expects files to be encoded as UTF-8.

To override this, Python enables you to include a comment at the top of a module, formatted in a particular way. The interpreter will read this comment and use it as an encoding declaration.

The format for specifying the encoding for a Python file is as follows:

# -*- coding: utf-8 -*-

You can use any codec that can be passed to encode and decode here. So, values such as ascii, latin-1, and cf-1252 are all acceptable (assuming, of course, that the file is encoded that way).

Consider the same module with a coding declaration:

# -*- coding: utf-8 -*-
text_str = u'Γεια σας, τον κόσμο.'
print(text_str)

If you run this modified file under Python 2, it will now succeed instead of raising a syntax error.

$ python2.7 unicode.py
Γεια σας, τον κόσμο.

Note that, if you choose to manually specify an encoding for a Python module, it is your responsibility to ensure that the encoding you specify is actually correct. Like any other document format, Python modules are not exempt from the possibility of declaring one encoding while actually using another.

If you accidentally specify the wrong encoding, your strings will come out as garbage. Consider what happens if the same file is declared to be encoded using latin-1 (when it is actually using utf-8 characters).

# -*- coding: latin-1 -*-
text_str = u'Γεια σας, τον κόσμο.'
print(text_str)

Running this in either Python 2 or Python 3.3+ will produce the same result, which is complete garbage.

$ python3.4 unicode.py
Î"ÎµÎ¹Î± ÏƒÎ±Ï‚, Ï„Î¿Î½ ÎºÏŒÏƒÎ¼Î¿.

Because the latin-1 codec can accept almost any byte stream, it does not actually recognize that this is not latin-1 encoded text, and cheerfully returns bad data. Some codecs (such as utf-8) are more strict, in which case you would get an exception instead. The latter situation is preferable, but neither is what you want. It is critical to declare encodings correctly.

Note also that this is dependent on your terminal's capability to display these characters. If you have a terminal that does not support Unicode, this will likely raise an exception.

Strict Codecs

One key advantage of utf-8 as a codec is that, in addition to supporting the entire range of Unicode characters, it also is a “strict” codec. This means that it does not just take any byte stream and decode it. It can usually detect that non-Unicode byte streams are invalid and fail.

This can lead to helpful patterns when you're dealing with a byte stream where the encoding is not known (because there is no way to infer the encoding with certainty). For example, if you think that a byte stream might be utf-8 and might be latin-1, you can try both, as shown here:

try:
    text_str = byte_str.decode('utf-8')
except UnicodeDecodeError:
    text_str = byte_str.decode('latin-1')

Of course, this is not a panacea. What happens, for example, if you get a byte string encoded as something entirely different? Because latin-1 is a permissive codec, it will decode it incorrectly.

Suppressing Errors

Sometimes when you are decoding or encoding text using strict codecs (such as utf-8 or ascii), you do not want a strict exception when the codec encounters text that it does not know how to handle.

The encode and decode methods provide a mechanism to ask a codec to behave differently when it encounters a set of characters that it cannot handle. Both methods take an optional second argument, errors, specified as a string. The default value is strict, which is what raises exception classes such as UnicodeDecodeError. The two other common error handlers are ignore and replace.

The ignore error handler simply skips over any bytes that the codec does not know how to decode. Consider what happens if you attempt to decode your Greek text as ASCII, as shown here:

>>> text_str = 'Γεια σας, τον κόσμο.'
>>> byte_str = text_str.encode('utf-8')
>>> byte_str.decode('ascii', 'ignore')
' ,  .'

The ASCII code does not know how to handle any of the Greek characters, but it does know how to handle the spaces and punctuation. Therefore, it preserves those, but strips all of the foreign characters.

The replace error handler is similar, but instead of skipping over unrecognized characters, it replaces them with a placeholder character. The exact placeholder character varies slightly based on the situation (whether encoding or decoding, and what codec is in use), but is usually either a question mark (?) or a special Unicode question mark diamond character ().

Here is the result if you try to decode your Greek text using the ascii codec and the replace error handler:

>>> text_str = 'Γεια σας, τον κόσμο.'
>>> byte_str = text_str.encode('utf-8')
>>> byte_str.decode('ascii', 'replace')
' ,  .'

And here is the result if you try to encode your Greek text to a byte string using the ascii codec and the replace error handler:

>>> text_str = 'Γεια σας, τον κόσμο.'
>>> text_str.encode('ascii', 'replace')
b'???? ???, ??? ?????.'

You may notice that when using the replace error handler, the number of replacement characters may not be 1:1 with the number of characters in the actual text string. When decoding a byte string using the ascii codec, the codec has no way of knowing how many bytes correspond to each character, so it ends up showing more question marks than there are actual characters in the text string.

Registering Error Handlers

It is possible to register additional error handlers if the built-in ones are insufficient. The codecs module (where the default error handlers are defined) exposes a function for registering additional error handlers, named register_error. It takes two arguments: the name for the error handler and the actual function that does the error handling.

That function receives the exception that would otherwise be raised, and is responsible for re-raising it, raising another exception, or returning an appropriate string value to be substituted into the resulting string.

The exception instance contains start and end attributes that correspond to the substring that the codec is unable to encode or decode. It also has a reason attribute with a human-readable explanation of the reason why it is unable to encode or decode the characters in question, and an object attribute with the original string.

xIf returning a replacement value, the error function must return a tuple with two elements. The first element is the replacement character or characters, and the second is the position in the original string where encoding or decoding should continue. Usually, this corresponds to the end attribute on the exception instance. If you do this, be careful with the start position you return. It is very easy to get into an infinite loop scenario.

The following example simply replaces characters with a different substitution character:

import codecs

def replace_with_underscore(err):
    length = err.end - err.start
    return ('_' * length, err.end)
codecs.register_error('replace_with_underscore', replace_with_underscore)

This error handler replaces unknown characters, but using underscores rather than question marks. The following is what happens if you decode a byte string with Unicode Greek text using the ascii codec and this error handler:

>>> text_str = 'Γεια σας, τον κόσμο.'
>>> byte_str = text_str.encode('utf-8')
>>> byte_str.decode('ascii', 'replace_with_underscore')
'________ ______, _____ __________.'

Summary

Handling string data can be surprisingly frustrating. It is easier than you might expect to create a program that works right up until it encounters textual data that is dissimilar to what it expected.

When possible, try to have as much of your program as possible handle text strings. It is a good idea to decode byte strings as soon as possible after you receive them. Similarly, when writing data out, endeavor to encode your text strings to byte strings as late as possible.

Sometimes decoding is difficult. You may not know how a byte string is encoded, or you may be told an encoding, but be told wrong. This is challenging, and there is no easy solution.

Remember, the Python interpreter is your friend here. If you are dealing with problematic data, and you do not know the encoding, you may be able to interactively decode a sample of it using different codecs until you find something that looks reasonable. Of course, this manual approach assumes that the data you are coding for will always be similar to the sample data you are using.

The key thing to remember when handling string data is to ensure that you always know what kind of string you are dealing with. The worst and most frustrating problems crop up when you expect a text string and receive a byte string, or vice versa. Be sure to keep them straight.

Chapter 9 explores regular expressions, which are a mechanism for searching strings for data that matches a given pattern.

Previous Chapter

Part III: Data

Next Chapter

Chapter 9: Regular Expressions

Table of Contents for Professional Python